Classification - Instacart Product Reorder Prediction
Predicting whether a customer will reorder products in their next order using the
Instacart Market Basket Analysis dataset.
Dataset Source:
Kaggle Instacart Market Basket Analysis
Problem Type: Classification Target Variable: Binary prediction of whether a
product will be reordered (1) or not (0) Use Case: E-commerce recommendation
systems, inventory management, personalized marketing campaigns
Package Imports
Xplainable Cloud Setup
!pip install xplainable
!pip install xplainable-client
import pandas as pd
import xplainable as xp
from xplainable.core.models import XClassifier
from xplainable.core.optimisation.bayesian import XParamOptimiser
from xplainable.preprocessing.pipeline import XPipeline
from xplainable.preprocessing import transformers as xtf
from sklearn.model_selection import train_test_split
import requests
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import gc
import xplainable_client
client = xplainable_client.Client(
api_key="",
)
Data Loading and Exploration
Load the Instacart Market Basket Analysis datasets.
Note: Download the datasets from
Kaggle and
extract the CSV files, or use the direct download links below.
Dataset Download Instructions
Option 1: Direct Download (Recommended for beginners)
- Visit the
Kaggle Instacart Market Basket Analysis competition
- Create a free Kaggle account if you don't have one
- Download the dataset ZIP file (~200MB)
- Extract all CSV files to your working directory
Option 2: Kaggle API (Recommended for experienced users)
pip install kaggle
kaggle competitions download -c instacart-market-basket-analysis
unzip instacart-market-basket-analysis.zip
Note: The dataset contains 6 CSV files with over 3 million orders and 32 million
order products.
try:
orders = pd.read_csv('orders.csv')
order_products_train = pd.read_csv('order_products__train.csv')
order_products_prior = pd.read_csv('order_products__prior.csv')
products = pd.read_csv('products.csv')
aisles = pd.read_csv('aisles.csv')
departments = pd.read_csv('departments.csv')
print("Datasets loaded successfully!")
print(f"Orders: {orders.shape}")
print(f"Order Products (Train): {order_products_train.shape}")
print(f"Order Products (Prior): {order_products_prior.shape}")
print(f"Products: {products.shape}")
print(f"Aisles: {aisles.shape}")
print(f"Departments: {departments.shape}")
except FileNotFoundError as e:
print(f"Error loading files: {e}")
print("Please ensure you have downloaded and extracted the Instacart dataset files.")
print("Files should be in the same directory as this notebook.")
Inspecting orders dataset
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
---|
0 | 2539329 | 1 | prior | 1 | 2 | 8 | nan |
1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15 |
2 | 473747 | 1 | prior | 3 | 3 | 12 | 21 |
3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29 |
4 | 431534 | 1 | prior | 5 | 4 | 15 | 28 |
Out: <class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 order_id int64
1 user_id int64
2 eval_set object
3 order_number int64
4 order_dow int64
5 order_hour_of_day int64
6 days_since_prior_order float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB
Out: order_id 0
user_id 0
eval_set 0
order_number 0
order_dow 0
order_hour_of_day 0
days_since_prior_order 206209
dtype: int64
We have observed that there are 206209 missing values in days_since_prior_order column.
Inspecting order_products_train dataset
order_products_train.head()
| order_id | product_id | add_to_cart_order | reordered |
---|
0 | 1 | 49302 | 1 | 1 |
1 | 1 | 11109 | 2 | 1 |
2 | 1 | 10246 | 3 | 0 |
3 | 1 | 49683 | 4 | 0 |
4 | 1 | 43633 | 5 | 1 |
order_products_train.shape
order_products_train.isnull().sum()
Out: order_id 0
product_id 0
add_to_cart_order 0
reordered 0
dtype: int64
Inspecting order_products_prior dataset
order_products_prior.head()
| order_id | product_id | add_to_cart_order | reordered |
---|
0 | 2 | 33120 | 1 | 1 |
1 | 2 | 28985 | 2 | 1 |
2 | 2 | 9327 | 3 | 0 |
3 | 2 | 45918 | 4 | 1 |
4 | 2 | 30035 | 5 | 0 |
order_products_prior.shape
order_products_prior.isnull().sum()
Out: order_id 0
product_id 0
add_to_cart_order 0
reordered 0
dtype: int64
Inspect products dataset
| product_id | product_name | aisle_id | department_id |
---|
0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
1 | 2 | All-Seasons Salt | 104 | 13 |
2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 |
4 | 5 | Green Chile Anytime Sauce | 5 | 13 |
Out: product_id 0
product_name 0
aisle_id 0
department_id 0
dtype: int64
Inspecting aisles dataset
| aisle_id | aisle |
---|
0 | 1 | prepared soups salads |
1 | 2 | specialty cheeses |
2 | 3 | energy granola bars |
3 | 4 | instant foods |
4 | 5 | marinades meat preparation |
Out: aisle_id 0
aisle 0
dtype: int64
Inspecting departments dataset
| department_id | department |
---|
0 | 1 | frozen |
1 | 2 | other |
2 | 3 | bakery |
3 | 4 | produce |
4 | 5 | alcohol |
departments.isnull().sum()
Out: department_id 0
department 0
dtype: int64
Exploratory Data Analysis (EDA)
plt.figure(figsize=(6,4))
sns.countplot(x="order_dow", data=orders, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Orders by week day", fontsize=15)
plt.show()

The number of orders on weekends is more compared to weekdays as people stay at home and
might have wanted to enjoy the foods.
plt.figure(figsize=(6,4))
sns.countplot(x="order_hour_of_day", data=orders, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Orders by Hour of day", fontsize=15)
plt.show()

Peak hours wheremaximum orders are done is between 9 AM- 5PM. Less orders are placed
before 7AM and after 11 PM.
plt.figure(figsize=(10,6))
sns.countplot(orders['days_since_prior_order'])
plt.xticks(rotation=90)
plt.show()
Maximum number of users order again after 1 month. People also order after a week and
this forms the second largest order habit.
products_details = pd.merge(left=products,right=departments,how="left")
products_details = pd.merge(left=products_details,right=aisles,how="left")
products_details.head()
| product_id | product_name | aisle_id | department_id | department | aisle |
---|
0 | 1 | Chocolate Sandwich Cookies | 61 | 19 | snacks | cookies cakes |
1 | 2 | All-Seasons Salt | 104 | 13 | pantry | spices seasonings |
2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 | beverages | tea |
3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 | frozen | frozen meals |
4 | 5 | Green Chile Anytime Sauce | 5 | 13 | pantry | marinades meat preparation |
plt.figure(figsize=(10,6))
g=sns.countplot(x="department",data=products_details)
g.set_xticklabels(g.get_xticklabels(), rotation=40, ha="right")
plt.show()

Personal care is the most abundant type of department available followed by snacks.
plt.figure(figsize=(10,6))
top10_aisle=products_details["aisle"].value_counts()[:10].plot(kind="bar",title='Aisles')

missing is the aisle with most products available.
order_products_name_train = pd.merge(left=order_products_train,right=products.loc[:,["product_id","product_name"]],on="product_id",how="left")
common_Products=order_products_name_train[order_products_name_train.reordered == 1]["product_name"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="product_name", y="index", data=common_Products.head(10))
plt.ylabel('product_name', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

Banana is the most common type of product bought by people followed by Bag of organic
banana.
order_products_name_train = pd.merge(left=order_products_name_train,right=products_details.loc[:,["product_id","aisle","department"]],on="product_id",how="left")
common_aisle=order_products_name_train["aisle"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="aisle", y="index", data=common_aisle.head(10),palette="Blues_d")
plt.ylabel('aisle', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

Fresh vegetable aisle has the highest number of sales followed by fresh_fruits.
common_aisle=order_products_name_train["department"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="department", y="index", data=common_aisle,palette="Blues_d")
plt.ylabel('department', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

produce and dairy eggs are the top 2 departments with the highest number of sales.
train_data_reordered = order_products_train.groupby(["order_id","reordered"])["product_id"].apply(list).reset_index()
train_data_reordered = train_data_reordered[train_data_reordered.reordered == 1].drop(columns=["reordered"]).reset_index(drop=True)
train_data_reordered.head()
| order_id | product_id |
---|
0 | 1 | [49302, 11109, 43633, 22035] |
1 | 36 | [19660, 43086, 46620, 34497, 48679, 46979] |
2 | 38 | [21616] |
3 | 96 | [20574, 40706, 27966, 24489, 39275] |
4 | 98 | [8859, 19731, 43654, 13176, 4357, 37664, 34065... |
1. Data Preprocessing
Feature Engineering and Data Preparation
Create features for user behavior, product characteristics, and user-product
interactions.
del products_details
del order_products_name_train
del common_Products
del common_aisle
del train_data_reordered
gc.collect()
orders = orders.loc[orders.user_id.isin(orders.user_id.drop_duplicates().sample(frac=0.15, random_state=101))]
aisles['aisle'] = aisles['aisle'].astype('category')
departments['department'] = departments['department'].astype('category')
orders['eval_set'] = orders['eval_set'].astype('category')
products['product_name'] = products['product_name'].astype('category')
prior_orders = pd.merge(orders, order_products_prior, on='order_id', how='inner')
prior_orders.head()
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | product_id | add_to_cart_order | reordered |
---|
0 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 196 | 1 | 0 |
1 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 14084 | 2 | 0 |
2 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 12427 | 3 | 0 |
3 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26088 | 4 | 0 |
4 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26405 | 5 | 0 |
Create Features using user_id
users = prior_orders.groupby(by='user_id')['order_number'].aggregate('max').to_frame('num_of_orders_for_each_user').reset_index()
users.head()
| user_id | num_of_orders_for_each_user |
---|
0 | 1 | 10 |
1 | 2 | 14 |
2 | 3 | 12 |
3 | 4 | 5 |
4 | 5 | 4 |
toal_product_per_order = prior_orders.groupby(by=['user_id', 'order_id'])['product_id'].aggregate('count').to_frame('total_products_per_order').reset_index()
avg_number_of_products_per_order = toal_product_per_order.groupby(by=['user_id'])['total_products_per_order'].mean().to_frame('avg_no_prd_per_order').reset_index()
del [toal_product_per_order]
gc.collect()
avg_number_of_products_per_order.head()
| user_id | avg_no_prd_per_order |
---|
0 | 1 | 5.9 |
1 | 2 | 13.9286 |
2 | 3 | 7.33333 |
3 | 4 | 3.6 |
4 | 5 | 9.25 |
from scipy import stats
import pandas as pd
import numpy as np
def calculate_mode(x):
if len(x) > 0:
mode_result = stats.mode(x)
if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
return mode_result.mode[0]
else:
return mode_result.mode
else:
return pd.NA
order_most_dow = prior_orders.groupby(by=['user_id'])['order_dow'].aggregate(calculate_mode).to_frame('dow_with_most_orders').reset_index()
order_most_dow.head()
| user_id | dow_with_most_orders |
---|
0 | 1 | 4 |
1 | 2 | 2 |
2 | 3 | 0 |
3 | 4 | 4 |
4 | 5 | 3 |
def calculate_mode_hour(x):
if len(x) > 0:
mode_result = stats.mode(x)
if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
return mode_result.mode[0]
else:
return mode_result.mode
else:
return pd.NA
order_most_hod = prior_orders.groupby(by=['user_id'])['order_hour_of_day'].aggregate(calculate_mode_hour).to_frame('hod_with_most_orders').reset_index()
order_most_hod.head()
| user_id | hod_with_most_orders |
---|
0 | 1 | 7 |
1 | 2 | 9 |
2 | 3 | 16 |
3 | 4 | 15 |
4 | 5 | 18 |
user_reorder_ratio = prior_orders.groupby(by='user_id')['reordered'].aggregate('mean').to_frame('reorder_ratio').reset_index()
user_reorder_ratio['reorder_ratio'] = user_reorder_ratio['reorder_ratio'].astype(np.float16)
user_reorder_ratio.head()
| user_id | reorder_ratio |
---|
0 | 1 | 0.694824 |
1 | 2 | 0.476807 |
2 | 3 | 0.625 |
3 | 4 | 0.055542 |
4 | 5 | 0.378418 |
users = users.merge(avg_number_of_products_per_order, on='user_id', how='left')
users = users.merge(order_most_dow, on='user_id', how='left')
users = users.merge(order_most_hod, on='user_id', how='left')
users = users.merge(user_reorder_ratio, on='user_id', how='left')
users.head()
| user_id | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio |
---|
0 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 |
1 | 2 | 14 | 13.9286 | 13.9286 | 2 | 9 | 0.476807 |
2 | 3 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 |
3 | 4 | 5 | 3.6 | 3.6 | 4 | 15 | 0.055542 |
4 | 5 | 4 | 9.25 | 9.25 | 3 | 18 | 0.378418 |
del [avg_number_of_products_per_order,order_most_dow,order_most_hod,user_reorder_ratio]
gc.collect()
Create features using product_id.
purchased_num_of_times = prior_orders.groupby(by='product_id')['order_id'].aggregate('count').to_frame('purchased_num_of_times').reset_index()
purchased_num_of_times.head()
| product_id | purchased_num_of_times |
---|
0 | 1 | 1852 |
1 | 2 | 90 |
2 | 3 | 277 |
3 | 4 | 329 |
4 | 5 | 15 |
product_reorder_ratio = prior_orders.groupby(by='product_id')['reordered'].aggregate('mean').to_frame('product_reorder_ratio').reset_index()
product_reorder_ratio.head()
| product_id | product_reorder_ratio |
---|
0 | 1 | 0.613391 |
1 | 2 | 0.133333 |
2 | 3 | 0.732852 |
3 | 4 | 0.446809 |
4 | 5 | 0.6 |
add_to_cart = prior_orders.groupby(by='product_id')['add_to_cart_order'].aggregate('mean').to_frame('product_avg_cart_addition').reset_index()
add_to_cart.head()
| product_id | product_avg_cart_addition |
---|
0 | 1 | 5.80184 |
1 | 2 | 9.88889 |
2 | 3 | 6.41516 |
3 | 4 | 9.5076 |
4 | 5 | 6.46667 |
purchased_num_of_times = purchased_num_of_times.merge(product_reorder_ratio, on='product_id', how='left')
purchased_num_of_times = purchased_num_of_times.merge(add_to_cart, on='product_id', how='left')
del [product_reorder_ratio, add_to_cart]
gc.collect()
purchased_num_of_times.head()
| product_id | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition |
---|
0 | 1 | 1852 | 0.613391 | 5.80184 |
1 | 2 | 90 | 0.133333 | 9.88889 |
2 | 3 | 277 | 0.732852 | 6.41516 |
3 | 4 | 329 | 0.446809 | 9.5076 |
4 | 5 | 15 | 0.6 | 6.46667 |
Creating features using user_id and product_id
user_product_data = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_times_bought').reset_index()
user_product_data.head()
| user_id | product_id | uxp_times_bought |
---|
0 | 1 | 196 | 10 |
1 | 1 | 10258 | 9 |
2 | 1 | 10326 | 1 |
3 | 1 | 12427 | 10 |
4 | 1 | 13032 | 3 |
product_first_order_num = prior_orders.groupby(by=['user_id', 'product_id'])['order_number'].aggregate('min').to_frame('first_order_number').reset_index()
product_first_order_num.head()
| user_id | product_id | first_order_number |
---|
0 | 1 | 196 | 1 |
1 | 1 | 10258 | 2 |
2 | 1 | 10326 | 5 |
3 | 1 | 12427 | 1 |
4 | 1 | 13032 | 2 |
total_orders = prior_orders.groupby('user_id')['order_number'].max().to_frame('total_orders').reset_index()
total_orders.head()
| user_id | total_orders |
---|
0 | 1 | 10 |
1 | 2 | 14 |
2 | 3 | 12 |
3 | 4 | 5 |
4 | 5 | 4 |
user_product_df = pd.merge(total_orders, product_first_order_num, on='user_id', how='right')
user_product_df.head()
| user_id | total_orders | product_id | first_order_number |
---|
0 | 1 | 10 | 196 | 1 |
1 | 1 | 10 | 10258 | 2 |
2 | 1 | 10 | 10326 | 5 |
3 | 1 | 10 | 12427 | 1 |
4 | 1 | 10 | 13032 | 2 |
user_product_df['order_range'] = user_product_df['total_orders'] - user_product_df['first_order_number'] + 1
user_product_df.head()
| user_id | total_orders | product_id | first_order_number | order_range |
---|
0 | 1 | 10 | 196 | 1 | 10 |
1 | 1 | 10 | 10258 | 2 | 9 |
2 | 1 | 10 | 10326 | 5 | 6 |
3 | 1 | 10 | 12427 | 1 | 10 |
4 | 1 | 10 | 13032 | 2 | 9 |
number_of_times = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('times_bought').reset_index()
number_of_times.head()
| user_id | product_id | times_bought |
---|
0 | 1 | 196 | 10 |
1 | 1 | 10258 | 9 |
2 | 1 | 10326 | 1 |
3 | 1 | 12427 | 10 |
4 | 1 | 13032 | 3 |
uxp_ratio = pd.merge(number_of_times, user_product_df, on=['user_id', 'product_id'], how='left')
uxp_ratio.head()
| user_id | product_id | times_bought | total_orders | first_order_number | order_range |
---|
0 | 1 | 196 | 10 | 10 | 1 | 10 |
1 | 1 | 10258 | 9 | 10 | 2 | 9 |
2 | 1 | 10326 | 1 | 10 | 5 | 6 |
3 | 1 | 12427 | 10 | 10 | 1 | 10 |
4 | 1 | 13032 | 3 | 10 | 2 | 9 |
uxp_ratio['uxp_reorder_ratio'] = uxp_ratio['times_bought'] / uxp_ratio['order_range']
uxp_ratio.head()
| user_id | product_id | times_bought | total_orders | first_order_number | order_range | uxp_reorder_ratio |
---|
0 | 1 | 196 | 10 | 10 | 1 | 10 | 1 |
1 | 1 | 10258 | 9 | 10 | 2 | 9 | 1 |
2 | 1 | 10326 | 1 | 10 | 5 | 6 | 0.166667 |
3 | 1 | 12427 | 10 | 10 | 1 | 10 | 1 |
4 | 1 | 13032 | 3 | 10 | 2 | 9 | 0.333333 |
uxp_ratio.drop(['times_bought', 'total_orders', 'first_order_number', 'order_range'], axis=1, inplace=True)
uxp_ratio.head()
| user_id | product_id | uxp_reorder_ratio |
---|
0 | 1 | 196 | 1 |
1 | 1 | 10258 | 1 |
2 | 1 | 10326 | 0.166667 |
3 | 1 | 12427 | 1 |
4 | 1 | 13032 | 0.333333 |
user_product_data = user_product_data.merge(uxp_ratio, on=['user_id', 'product_id'], how='left')
del [product_first_order_num, number_of_times,user_product_df,total_orders, uxp_ratio]
gc.collect()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio |
---|
0 | 1 | 196 | 10 | 1 |
1 | 1 | 10258 | 9 | 1 |
2 | 1 | 10326 | 1 | 0.166667 |
3 | 1 | 12427 | 10 | 1 |
4 | 1 | 13032 | 3 | 0.333333 |
prior_orders['order_number_back'] = prior_orders.groupby(by=['user_id'])['order_number'].transform(max) - prior_orders.order_number + 1
prior_orders.head()
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | product_id | add_to_cart_order | reordered | order_number_back |
---|
0 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 196 | 1 | 0 | 10 |
1 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 14084 | 2 | 0 | 10 |
2 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 12427 | 3 | 0 | 10 |
3 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26088 | 4 | 0 | 10 |
4 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26405 | 5 | 0 | 10 |
temp_df = prior_orders.loc[prior_orders.order_number_back <= 3]
temp_df.head()
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | product_id | add_to_cart_order | reordered | order_number_back |
---|
38 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 12427 | 1 | 1 | 3 |
39 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 196 | 2 | 1 | 3 |
40 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 10258 | 3 | 1 | 3 |
41 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 25133 | 4 | 1 | 3 |
42 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 46149 | 5 | 0 | 3 |
last_three_order = temp_df.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_last_three').reset_index()
last_three_order.head()
| user_id | product_id | uxp_last_three |
---|
0 | 1 | 196 | 3 |
1 | 1 | 10258 | 3 |
2 | 1 | 12427 | 3 |
3 | 1 | 13032 | 1 |
4 | 1 | 25133 | 3 |
last_three_order['uxp_ratio_last_three'] = last_three_order['uxp_last_three'] / 3
last_three_order.head()
| user_id | product_id | uxp_last_three | uxp_ratio_last_three |
---|
0 | 1 | 196 | 3 | 1 |
1 | 1 | 10258 | 3 | 1 |
2 | 1 | 12427 | 3 | 1 |
3 | 1 | 13032 | 1 | 0.333333 |
4 | 1 | 25133 | 3 | 1 |
user_product_data = user_product_data.merge(last_three_order, on=['user_id', 'product_id'], how='left')
del [last_three_order, temp_df]
gc.collect()
user_product_data.head().head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 |
2 | 1 | 10326 | 1 | 0.166667 | nan | nan |
3 | 1 | 12427 | 10 | 1 | 3 | 1 |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 |
user_product_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 8382738
uxp_ratio_last_three 8382738
dtype: int64
user_product_data.fillna(0, inplace=True)
user_product_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 0
uxp_ratio_last_three 0
dtype: int64
Create final dataframe for engineered features
featured_engineered_data = user_product_data.merge(users, on='user_id', how='left')
featured_engineered_data = featured_engineered_data.merge(purchased_num_of_times, on='product_id', how='left')
del [users, user_product_data, purchased_num_of_times]
gc.collect()
featured_engineered_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 |
2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 |
3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 |
featured_engineered_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 0
uxp_ratio_last_three 0
num_of_orders_for_each_user 0
avg_no_prd_per_order_x 0
avg_no_prd_per_order_y 0
dow_with_most_orders 0
hod_with_most_orders 0
reorder_ratio 0
purchased_num_of_times 0
product_reorder_ratio 0
product_avg_cart_addition 0
dtype: int64
Creating Train and Test datasets
Create training dataset
orders_future = orders[((orders.eval_set=='train') | (orders.eval_set=='test'))]
orders_future = orders_future[['user_id', 'eval_set', 'order_id']]
final_data = featured_engineered_data.merge(orders_future, on='user_id', how='left')
final_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 |
2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 |
3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 |
train_data = final_data[final_data.eval_set=='train']
train_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 |
2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 |
3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 |
train_data = train_data.merge(order_products_train[['product_id', 'order_id', 'reordered']], on=['product_id', 'order_id'], how='left')
train_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id | reordered |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 | 1 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 | 1 |
2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 | nan |
3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 | nan |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 | 1 |
train_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 0
uxp_ratio_last_three 0
num_of_orders_for_each_user 0
avg_no_prd_per_order_x 0
avg_no_prd_per_order_y 0
dow_with_most_orders 0
hod_with_most_orders 0
reorder_ratio 0
purchased_num_of_times 0
product_reorder_ratio 0
product_avg_cart_addition 0
eval_set 0
order_id 0
reordered 7645837
dtype: int64
train_data['reordered'] = train_data['reordered'].fillna(0)
train_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id | reordered |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 | 1 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 | 1 |
2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 | 0 |
3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 | 0 |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 | 1 |
train_data = train_data.set_index(['user_id', 'product_id'])
train_data = train_data.drop(['eval_set', 'order_id'], axis=1)
| | uxp_times_bought_Unnamed: 2_level_1 | uxp_reorder_ratio_Unnamed: 3_level_1 | uxp_last_three_Unnamed: 4_level_1 | uxp_ratio_last_three_Unnamed: 5_level_1 | num_of_orders_for_each_user_Unnamed: 6_level_1 | avg_no_prd_per_order_x_Unnamed: 7_level_1 | avg_no_prd_per_order_y_Unnamed: 8_level_1 | dow_with_most_orders_Unnamed: 9_level_1 | hod_with_most_orders_Unnamed: 10_level_1 | reorder_ratio_Unnamed: 11_level_1 | purchased_num_of_times_Unnamed: 12_level_1 | product_reorder_ratio_Unnamed: 13_level_1 | product_avg_cart_addition_Unnamed: 14_level_1 | reordered_Unnamed: 15_level_1 |
---|
1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | 1 |
1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | 1 |
1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | 0 |
1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | 0 |
1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | 1 |
Create testing dataset
test_data = final_data[final_data.eval_set=='test']
test_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id |
---|
120 | 3 | 248 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 6371 | 0.400251 | 10.6208 | test | 2774568 |
121 | 3 | 1005 | 1 | 0.333333 | 1 | 0.333333 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 463 | 0.440605 | 9.49892 | test | 2774568 |
122 | 3 | 1819 | 3 | 0.333333 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 2424 | 0.492162 | 9.28754 | test | 2774568 |
123 | 3 | 7503 | 1 | 0.1 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 12474 | 0.553551 | 9.54738 | test | 2774568 |
124 | 3 | 8021 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 27864 | 0.591157 | 8.82285 | test | 2774568 |
test_data = test_data.set_index(['user_id', 'product_id'])
test_data = test_data.drop(['eval_set', 'order_id'], axis=1)
test_data.head()
| | uxp_times_bought_Unnamed: 2_level_1 | uxp_reorder_ratio_Unnamed: 3_level_1 | uxp_last_three_Unnamed: 4_level_1 | uxp_ratio_last_three_Unnamed: 5_level_1 | num_of_orders_for_each_user_Unnamed: 6_level_1 | avg_no_prd_per_order_x_Unnamed: 7_level_1 | avg_no_prd_per_order_y_Unnamed: 8_level_1 | dow_with_most_orders_Unnamed: 9_level_1 | hod_with_most_orders_Unnamed: 10_level_1 | reorder_ratio_Unnamed: 11_level_1 | purchased_num_of_times_Unnamed: 12_level_1 | product_reorder_ratio_Unnamed: 13_level_1 | product_avg_cart_addition_Unnamed: 14_level_1 |
---|
3 | 248 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 6371 | 0.400251 | 10.6208 |
3 | 1005 | 1 | 0.333333 | 1 | 0.333333 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 463 | 0.440605 | 9.49892 |
3 | 1819 | 3 | 0.333333 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 2424 | 0.492162 | 9.28754 |
3 | 7503 | 1 | 0.1 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 12474 | 0.553551 | 9.54738 |
3 | 8021 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 27864 | 0.591157 | 8.82285 |
del [final_data, orders_future, products, order_products_train]
gc.collect()
Building model using Xplainable Classifier
Build X_train and y_train dataset
X_train, y_train = train_data.drop('reordered', axis=1), train_data.reordered
2. Model Optimization
Optimize hyperparameters using a subset of the data for computational efficiency.
from xplainable.core.optimisation.bayesian import XParamOptimiser
opt = XParamOptimiser()
params = opt.optimise(X_train[:1000000], y_train[:1000000])
Out: 100%|████████| 30/30 [00:45<00:00, 1.53s/trial, best loss: -0.8107380819617527]
3. Model Training
Train the XClassifier with optimized parameters on the full training dataset.
from xplainable.core.models import XClassifier
model = XClassifier(**params)
model.fit(X_train, y_train)
Out: <xplainable.core.ml.classification.XClassifier at 0x2a426f760>
4. Model Interpretability and Explainability
Analyze feature importance and model decision-making for the product reorder
predictions.
In the Feature Importances section, we see a list of features with corresponding
importance values. The feature uxp_reorder_ratio
has the highest importance,
indicating that it is the most influential factor in the model's predictions.
On the Contributions side, the uxp_reorder_ratio feature also shows a notable
contribution to the model's output. The green bars represent positive contributions,
while the red bars indicate negative contributions. The specific contribution values are
again not directly visible, but the length and color of the bars suggest that
uxp_reorder_ratio
has a strong positive influence on the model's predictions.
5. Model Persisting (Optional)
Save the trained model to the Xplainable platform for future use and deployment.
print("Model persistence step skipped - uncomment above code to save model")
6. Model Deployment (Optional)
Deploy the model to the Xplainable platform for real-time predictions.
print("Model deployment step skipped - uncomment above code to deploy model")
7. Model Testing
Create model predictions using threshold cutoff.
NOTE: Adjust the threshold cutoff to see the impact on the result
test_prediction = (model.predict_proba(test_data) >= 0.21).astype(int)
test_prediction[:5]
Out: array([0, 0, 0, 0, 0])
train_prediction = (model.predict_proba(X_train) >= 0.21).astype(int)
train_prediction[:5]
Out: array([1, 1, 0, 1, 0])
from sklearn.metrics import f1_score, classification_report
print(f'f1 Score: {f1_score(train_prediction, y_train)}')
print(classification_report(train_prediction, y_train))
Out: f1 Score: 0.41808008442097677
precision recall f1-score support
0 0.92 0.94 0.93 7520042
1 0.45 0.39 0.42 954619
accuracy 0.88 8474661
macro avg 0.69 0.66 0.67 8474661
weighted avg 0.87 0.88 0.87 8474661
test_data['prediction'] = test_prediction
test_data.head()
| | uxp_times_bought_Unnamed: 2_level_1 | uxp_reorder_ratio_Unnamed: 3_level_1 | uxp_last_three_Unnamed: 4_level_1 | uxp_ratio_last_three_Unnamed: 5_level_1 | num_of_orders_for_each_user_Unnamed: 6_level_1 | avg_no_prd_per_order_x_Unnamed: 7_level_1 | avg_no_prd_per_order_y_Unnamed: 8_level_1 | dow_with_most_orders_Unnamed: 9_level_1 | hod_with_most_orders_Unnamed: 10_level_1 | reorder_ratio_Unnamed: 11_level_1 | purchased_num_of_times_Unnamed: 12_level_1 | product_reorder_ratio_Unnamed: 13_level_1 | product_avg_cart_addition_Unnamed: 14_level_1 | prediction_Unnamed: 15_level_1 |
---|
3 | 248 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 6371 | 0.400251 | 10.6208 | 0 |
3 | 1005 | 1 | 0.333333 | 1 | 0.333333 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 463 | 0.440605 | 9.49892 | 0 |
3 | 1819 | 3 | 0.333333 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 2424 | 0.492162 | 9.28754 | 0 |
3 | 7503 | 1 | 0.1 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 12474 | 0.553551 | 9.54738 | 0 |
3 | 8021 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 27864 | 0.591157 | 8.82285 | 0 |
final_df = test_data.reset_index()
final_df = final_df[['product_id', 'user_id', 'prediction']]
gc.collect()
final_df.head()
| product_id | user_id | prediction |
---|
0 | 248 | 3 | 0 |
1 | 1005 | 3 | 0 |
2 | 1819 | 3 | 0 |
3 | 7503 | 3 | 0 |
4 | 8021 | 3 | 0 |
Creating the Kaggle submission file (optional)
After developing a robust model and ensuring its performance on our validation set, the
next step is to prepare our submission for Kaggle. Although this step is optional, it is
a good practice to understand how to create a submission file that adheres to the
competition's requirements.
To create a submission file, you typically need to:
- Ensure that your model has been trained with the full training set or with an
appropriate cross-validation strategy.
- Generate predictions for the test set provided by Kaggle.
- Format these predictions into a CSV file that matches the submission format of the
competition, which usually involves setting the index to an
id
column and including
a column with your predictions.
- Use the
to_csv()
function from pandas with the appropriate parameters, such as
index=False
if the index should not be included in the submission file, to save
your dataframe to a CSV file.
- Upload this CSV file to the Kaggle competition's submission page to see how your
model performs on the unseen test set.
See specific steps for the kaggle upload below
orders_test = orders.loc[orders.eval_set == 'test', ['user_id', 'order_id']]
orders_test.head()
| user_id | order_id |
---|
38 | 3 | 2.77457e+06 |
44 | 4 | 329954 |
53 | 6 | 1.52801e+06 |
96 | 11 | 1.37694e+06 |
102 | 12 | 1.35684e+06 |
final_df = final_df.merge(orders_test, on='user_id', how='left')
final_df.head()
| product_id | user_id | prediction | order_id |
---|
0 | 248 | 3 | 0 | 2.77457e+06 |
1 | 1005 | 3 | 0 | 2.77457e+06 |
2 | 1819 | 3 | 0 | 2.77457e+06 |
3 | 7503 | 3 | 0 | 2.77457e+06 |
4 | 8021 | 3 | 0 | 2.77457e+06 |
final_df = final_df.drop('user_id', axis=1)
final_df['product_id'] = final_df.product_id.astype(int)
final_df.head()
| product_id | prediction | order_id |
---|
0 | 248 | 0 | 2.77457e+06 |
1 | 1005 | 0 | 2.77457e+06 |
2 | 1819 | 0 | 2.77457e+06 |
3 | 7503 | 0 | 2.77457e+06 |
4 | 8021 | 0 | 2.77457e+06 |
final_dict = dict()
for row in final_df.itertuples():
if row.prediction== 1:
try:
final_dict[row.order_id] += ' ' + str(row.product_id)
except:
final_dict[row.order_id] = str(row.product_id)
for order in final_df.order_id:
if order not in final_dict:
final_dict[order] = 'None'
gc.collect()
submission_df = pd.DataFrame.from_dict(final_dict, orient='index')
submission_df.reset_index(inplace=True)
submission_df.columns = ['order_id', 'products']
submission_df.head()
| order_id | products |
---|
0 | 2774568 | 17668 18599 21903 39190 43961 47766 |
1 | 1528013 | 21903 38293 |
2 | 1376945 | 8309 13176 14947 20383 27959 33572 35948 44632 |
3 | 1356845 | 5746 7076 8239 10863 11520 13176 14992 |
4 | 2161313 | 196 10441 11266 12427 14715 27839 37710 |
submission_df.to_csv('sub.csv', index=False, header=True)